Data-As-Material

Introduction

I am working with data summaries. First, mpg, then something else.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggformula)

Loading required package: scales

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Loading required package: ggridges

New to ggformula?  Try the tutorials: 
    learnr::run_tutorial("introduction", package = "ggformula")
    learnr::run_tutorial("refining", package = "ggformula")

library(mosaic)

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following object is masked from 'package:scales':

    rescale

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum

library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(skimr)


Attaching package: 'skimr'

The following object is masked from 'package:mosaic':

    n_missing

Look at the mpg dataset

mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

The table provides information on various car models from 1999 and 2008, highlighting key specifications. The data allows for a detailed comparison of the cars’ performance and specifications across different years and models.

First 10 rows of the mpg dataset

mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

  head(10)

[1] 10

The table displays the first 10 rows of the mpg dataset, with details on engine size (displacement), number of cylinders, transmission type (automatic or manual), drivetrain (front-wheel or all-wheel drive), and fuel efficiency in city and highway miles per gallon (MPG). The engine displacement ranges from 1.8 to 3.1 liters, and the number of cylinders is either 4 or 6. The city MPG varies from 16 to 21, while highway MPG ranges from 25 to 31.

Glimpse - mpg dataset

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

This summary provides an overview of the specifications and performance characteristics of car models in the dataset.

Inspect - mpg dataset

inspect(mpg)


categorical variables:  
          name     class levels   n missing
1 manufacturer character     15 234       0
2        model character     38 234       0
3        trans character     10 234       0
4          drv character      3 234       0
5           fl character      5 234       0
6        class character      7 234       0
                                   distribution
1 dodge (15.8%), toyota (14.5%) ...            
2 caravan 2wd (4.7%) ...                       
3 auto(l4) (35.5%), manual(m5) (24.8%) ...     
4 f (45.3%), 4 (44%), r (10.7%)                
5 r (71.8%), p (22.2%), e (3.4%) ...           
6 suv (26.5%), compact (20.1%) ...             

quantitative variables:  
   name   class    min     Q1 median     Q3  max        mean       sd   n
1 displ numeric    1.6    2.4    3.3    4.6    7    3.471795 1.291959 234
2  year integer 1999.0 1999.0 2003.5 2008.0 2008 2003.500000 4.509646 234
3   cyl integer    4.0    4.0    6.0    8.0    8    5.888889 1.611534 234
4   cty integer    9.0   14.0   17.0   19.0   35   16.858974 4.255946 234
5   hwy integer   12.0   18.0   24.0   27.0   44   23.440171 5.954643 234
  missing
1       0
2       0
3       0
4       0
5       0

The inspection of the mpg dataset reveals two types of variables: categorical and quantitative. Categorical variables include manufacturer, model, transmission, drivetrain, fl (fuel type), and class, with a total of 234 entries and no missing data. Quantitative variables, such as engine displacement, year, number of cylinders, city miles per gallon, and highway miles per gallon, are summarized with key statistics. For example, engine displacement ranges from 1.6 to 7 liters, and city MPG varies from 9 to 35, with an average of 16.86. This overview highlights the structure and details of the dataset, providing both descriptive and numerical insights.

Skim - mpg dataset

skim(mpg)

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
manufacturer	1	4	10	15
model	1	2	22	38
trans	1	8	10	10
drv	1	1	1	3
fl	1	1	1	5
class	1	3	10	7

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
displ	1	3.47	1.29	1.6	2.4	3.3	4.6	7	▇▆▆▃▁
year	1	2003.50	4.51	1999.0	1999.0	2003.5	2008.0	2008	▇▁▁▁▇
cyl	1	5.89	1.61	4.0	4.0	6.0	8.0	8	▇▁▇▁▇
cty	1	16.86	4.26	9.0	14.0	17.0	19.0	35	▆▇▃▁▁
hwy	1	23.44	5.95	12.0	18.0	24.0	27.0	44	▅▅▇▁▁

All variables have complete data with no missing values.This dataset provides a comprehensive view of car specifications and fuel efficiency, ready for further analysis.

Data Dictionary

Quantitative Data

Engine Displacement (dbl): The engine size in liters.
Model Year (int): The year of the car’s model, ranging from 1999 to 2008.
City Mileage (dbl): Miles per gallon (MPG) in city driving conditions.
Highway Mileage (dbl): Miles per gallon (MPG) in highway driving conditions.

Qualitative Data

Manufacturer (chr): The car’s manufacturer, e.g., Audi, Toyota.
Model (chr): The specific car model, e.g., A4, Corolla.
Transmission (chr): The type of transmission, e.g., auto (automatic), manual (m5/m6).
Drivetrain (chr): The type of drivetrain, e.g., f (front-wheel drive), 4 (four-wheel drive).
Fuel (chr): The type of fuel used, e.g., p (premium), r (regular).
Class of Vehicle (chr): The category of the vehicle, e.g., compact, SUV.
Cylinders (int): The number of cylinders in the engine (4, 6, etc.).

Data Munging

mpg_modified <- mpg %>%
  dplyr::mutate(
    cyl = as_factor(cyl),
    fl = as_factor(fl),
    drv = as_factor(drv),
    class = as_factor(class),
    trans = as_factor(trans)
  )
glimpse(mpg_modified)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <fct> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
$ drv          <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
$ class        <fct> compact, compact, compact, compact, compact, compact, com…

Here, several variables— cyl, fl, drv, class, and trans—have been converted from their original data types to factors. This transformation changes them into categorical variables, making them more suitable for analysis involving groupings or classifications.

Average Highway MPG grouped by the number of cylinders

mpg_modified %>%
  group_by(cyl) %>%
  summarize(average_hwy = mean(hwy), count = n())

# A tibble: 4 × 3
  cyl   average_hwy count
  <fct>       <dbl> <int>
1 4            28.8    81
2 5            28.8     4
3 6            22.8    79
4 8            17.6    70

The table summarizes the average highway miles per gallon for cars grouped by the number of cylinders. Cars with 4 cylinders have the highest average highway MPG at 28.80, followed closely by 5-cylinder cars with 28.75, though the 5-cylinder group only includes 4 cars. Cars with 6 cylinders average 22.82 MPG, while cars with 8 cylinders have the lowest fuel efficiency, averaging 17.63 MPG. Overall, the data shows that vehicles with fewer cylinders tend to be more fuel-efficient on the highway, with MPG decreasing as the number of cylinders increases.

Average City MPG grouped by the number of cylinders

mpg_modified %>%
  group_by(cyl) %>%
  summarize(average_hwy = mean(cty), count = n())

# A tibble: 4 × 3
  cyl   average_hwy count
  <fct>       <dbl> <int>
1 4            21.0    81
2 5            20.5     4
3 6            16.2    79
4 8            12.6    70

The table summarizes the average city miles per gallon based on the number of cylinders. Cars with 4 cylinders have the highest average city MPG at 21.01, followed by 5-cylinder cars at 20.50, though the sample size for 5-cylinder cars is small with only 4 cars. Vehicles with 6 cylinders average 16.22 MPG, while those with 8 cylinders have the lowest city MPG at 12.57. This data shows that cars with fewer cylinders tend to have better fuel efficiency in city driving.

Average Highway MPG grouped by the number of cylinders and fuel type

mpg_modified %>%
  group_by(cyl, fl) %>%
  summarize(average_hwy = mean(hwy), count = n())

`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.

# A tibble: 13 × 4
# Groups:   cyl [4]
   cyl   fl    average_hwy count
   <fct> <fct>       <dbl> <int>
 1 4     p            27.8    22
 2 4     r            28.3    55
 3 4     d            43       3
 4 4     c            36       1
 5 5     r            28.8     4
 6 6     p            25.3    17
 7 6     r            22.2    60
 8 6     e            17       1
 9 6     d            22       1
10 8     p            20.8    13
11 8     r            17.5    49
12 8     e            12.7     7
13 8     d            17       1

The table shows the average highway MPG for cars based on cylinders and fuel type. Cars with 4 cylinders are more fuel-efficient, with MPG ranging from 27.82 for premium fuel to 43 for diesel, though alternative fuel samples are small. For 6-cylinder cars, MPG drops to between 25.29 (petrol) and 17 (ethanol). 8-cylinder cars are the least efficient, with 17.51 MPG for regular fuel and 12.71 for ethanol. Overall, cars with fewer cylinders and certain fuels, like diesel, achieve better highway fuel efficiency.

Average City MPG grouped by the number of cylinders and fuel type

mpg_modified %>%
  group_by(cyl, fl) %>%
  summarize(average_hwy = mean(cty), count = n())

`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.

# A tibble: 13 × 4
# Groups:   cyl [4]
   cyl   fl    average_hwy count
   <fct> <fct>       <dbl> <int>
 1 4     p           19.9     22
 2 4     r           20.8     55
 3 4     d           32.3      3
 4 4     c           24        1
 5 5     r           20.5      4
 6 6     p           16.8     17
 7 6     r           16.1     60
 8 6     e           11        1
 9 6     d           17        1
10 8     p           13.8     13
11 8     r           12.7     49
12 8     e            9.57     7
13 8     d           14        1

The table provides a breakdown of the average city miles per gallon by cylinder count and fuel type (fl). For 4-cylinder cars, average city MPG varies by fuel type, with diesel cars achieving the highest at 32.33 MPG, followed by compressed natural gas (24 MPG) and regular fuel (20.78 MPG). 6-cylinder cars show lower MPG, with ethanol-fueled cars having the lowest efficiency at 11 MPG. For 8-cylinder cars, premium fuel provides an average of 13.77 MPG. (considering the fact that rows 11-13 are not visible for me)

Average Highway MPG for different car manufacturers

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(mean_mileage_manf=mean(hwy))

# A tibble: 15 × 2
   manufacturer mean_mileage_manf
   <chr>                    <dbl>
 1 audi                      26.4
 2 chevrolet                 21.9
 3 dodge                     17.9
 4 ford                      19.4
 5 honda                     32.6
 6 hyundai                   26.9
 7 jeep                      17.6
 8 land rover                16.5
 9 lincoln                   17  
10 mercury                   18  
11 nissan                    24.6
12 pontiac                   26.4
13 subaru                    25.6
14 toyota                    24.9
15 volkswagen                29.2

The table displays the average highway miles per gallon for different car manufacturers. Volkswagen leads with the highest average highway MPG at 29.22, followed closely by Honda (28.56), Hyundai (26.86), and Audi (26.44). Other manufacturers like Subaru, Pontiac, and Nissan also show relatively high fuel efficiency, with averages above 24 MPG. In contrast, manufacturers such as Dodge, Jeep, and Land Rover have the lowest average highway MPG, ranging from 16.5 to 17.9.

Average City MPG for different car manufacturers

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(mean_mileage_manf=mean(cty))

# A tibble: 15 × 2
   manufacturer mean_mileage_manf
   <chr>                    <dbl>
 1 audi                      17.6
 2 chevrolet                 15  
 3 dodge                     13.1
 4 ford                      14  
 5 honda                     24.4
 6 hyundai                   18.6
 7 jeep                      13.5
 8 land rover                11.5
 9 lincoln                   11.3
10 mercury                   13.2
11 nissan                    18.1
12 pontiac                   17  
13 subaru                    19.3
14 toyota                    18.5
15 volkswagen                20.9

The table displays the average city miles per gallon for different car manufacturers. Volkswagen leads with the highest average city MPG at 20.93, followed by Honda (24.44) and Subaru (19.29). Other manufacturers like Audi (17.61), Nissan (18.08), and Toyota (18.53) show moderate fuel efficiency. In contrast, manufacturers like Dodge, Jeep, and Land Rover have the lowest city MPG, ranging from 11.5 to 13.5.

Math Anxiety Data

I am working with Math Anxiety Data.

math_anxiety <- read_csv("../../data/MathAnxiety.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 599 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Age;Gender;Grade;AMAS;RCMAS;Arith

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

math_anxiety

# A tibble: 599 × 1
   `Age;Gender;Grade;AMAS;RCMAS;Arith`
   <chr>                              
 1 137,8;Boy;Secondary;9;20;6         
 2 140,7;Boy;Secondary;18;8;6         
 3 137,9;Girl;Secondary;23;26;5       
 4 142,8;Girl;Secondary;19;18;7       
 5 135,6;Boy;Secondary;23;20;1        
 6 135,0;Girl;Secondary;27;33;1       
 7 133,6;Boy;Secondary;22;23;4        
 8 139,3;Boy;Secondary;17;11;7        
 9 131,7;Girl;Secondary;28;32;2       
10 134,8;Boy;Secondary;20;30;6        
# ℹ 589 more rows

The Math Anxiety dataset contains 599 rows and focuses on variables such as Age, Gender, Grade, AMAS (Abbreviated Math Anxiety Scale), RCMAS (Revised Children’s Manifest Anxiety Scale), and Arithmetic scores. The data is structured in a format where each row represents an individual student with their respective attributes.

Specifying Delimiter

math_anxiety <- read_delim(file="../../data/MathAnxiety.csv",delim =";")

Rows: 599 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (2): Gender, Grade
dbl (3): AMAS, RCMAS, Arith
num (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset is read using a semicolon (;) delimiter.

First 10 rows of the Math Anxiety dataset

math_anxiety %>% 
  head(10)

# A tibble: 10 × 6
     Age Gender Grade      AMAS RCMAS Arith
   <dbl> <chr>  <chr>     <dbl> <dbl> <dbl>
 1  1378 Boy    Secondary     9    20     6
 2  1407 Boy    Secondary    18     8     6
 3  1379 Girl   Secondary    23    26     5
 4  1428 Girl   Secondary    19    18     7
 5  1356 Boy    Secondary    23    20     1
 6  1350 Girl   Secondary    27    33     1
 7  1336 Boy    Secondary    22    23     4
 8  1393 Boy    Secondary    17    11     7
 9  1317 Girl   Secondary    28    32     2
10  1348 Boy    Secondary    20    30     6

The table displays the first 10 rows of the Math Anxiety dataset, which consists of 6 variables: Age, Gender, Grade, AMAS (Abbreviated Math Anxiety Scale), RCMAS (Revised Children’s Manifest Anxiety Scale), and Arithmetic scores. The scores in AMAS, RCMAS, and Arithmetic vary across the students, showcasing different levels of math anxiety and performance. For instance, one student has an AMAS score of 9 and RCMAS score of 20, while another has an AMAS score of 28 and an RCMAS score of 32, indicating variability in anxiety levels among students.

Glimpse - math_anxiety

glimpse(math_anxiety)

Rows: 599
Columns: 6
$ Age    <dbl> 1378, 1407, 1379, 1428, 1356, 1350, 1336, 1393, 1317, 1348, 141…
$ Gender <chr> "Boy", "Boy", "Girl", "Girl", "Boy", "Girl", "Boy", "Boy", "Gir…
$ Grade  <chr> "Secondary", "Secondary", "Secondary", "Secondary", "Secondary"…
$ AMAS   <dbl> 9, 18, 23, 19, 23, 27, 22, 17, 28, 20, 16, 20, 21, 36, 16, 27, …
$ RCMAS  <dbl> 20, 8, 26, 18, 20, 33, 23, 11, 32, 30, 10, 4, 23, 26, 24, 21, 3…
$ Arith  <dbl> 6, 6, 5, 7, 1, 1, 4, 7, 2, 6, 2, 5, 2, 6, 2, 7, 2, 4, 7, 3, 8, …

The glimpse provides a quick overview of the structure and types of data within the dataset.

Inspect - math_anxiety

inspect(math_anxiety)


categorical variables:  
    name     class levels   n missing
1 Gender character      2 599       0
2  Grade character      2 599       0
                                   distribution
1 Boy (53.9%), Girl (46.1%)                    
2 Primary (66.9%), Secondary (33.1%)           

quantitative variables:  
   name   class min     Q1 median     Q3  max       mean         sd   n missing
1   Age numeric  37 1061.5   1208 1418.5 1875 1246.49249 223.112183 599       0
2  AMAS numeric   4   18.0     22   26.5   45   21.98164   6.597962 599       0
3 RCMAS numeric   1   14.0     19   25.0   41   19.24040   7.566802 599       0
4 Arith numeric   0    4.0      6    7.0    8    5.30217   2.105220 599       0

The summary of the Math Anxiety dataset, based on the inspect() function, reveals both categorical and quantitative variables. The categorical variables are Gender and Grade, with 53.9% of the entries labeled as Boy and 46.1% as Girl. In terms of Grade, 66.9% belong to the Primary level, while 33.1% are from the Secondary level. The quantitative variables include Age, AMAS (Math Anxiety Scale), RCMAS (Revised Children’s Manifest Anxiety Scale), and Arith (Arithmetic ability). The distribution of ages ranges from 37 to 1875, with a mean of 1246.49 and a standard deviation of 223.11. AMAS scores range from 4 to 45, with a mean of 21.98 and a standard deviation of 6.60. For RCMAS, the range is 1 to 41, with a mean of 19.24 and a standard deviation of 7.57. Finally, Arith scores vary from 0 to 8, with a mean of 5.30 and a standard deviation of 2.11.

Skim - math_anxiety

skim(math_anxiety)

Data summary
Name	math_anxiety
Number of rows	599
Number of columns	6
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Gender	0	1	3	4	0	2	0
Grade	0	1	7	9	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	1	1246.49	223.11	37	1061.5	1208	1418.5	1875	▁▁▇▇▃
AMAS	1	21.98	6.60	4	18.0	22	26.5	45	▂▆▇▃▁
RCMAS	1	19.24	7.57	1	14.0	19	25.0	41	▂▇▇▅▁
Arith	1	5.30	2.11	0	4.0	6	7.0	8	▂▃▃▇▇

This reveals a well-rounded dataset, complete with both categorical and numerical variables that offer a balanced view of the participants’ characteristics and their math-related anxiety scores. The dataset is complete with no missing data.

Data Dictionary

Quantitative Data

Age (dbl): The age of the participant, measured in years.
AMAS (dbl): American Mathematics Anxiety Scale (AMAS) score, indicating the level of math anxiety.
RCMAS (dbl): Revised Children’s Manifest Anxiety Scale (RCMAS) score, measuring general anxiety.
Arith (dbl): Arithmetic test score, indicating performance in a mathematics test.

Qualitative Data

Gender: The gender of the participant, categories include Boy and Girl.
Grade: The educational grade level of the participant, such as Secondary.

Data Munging

math_anxiety_modified <- math_anxiety %>%
  dplyr::mutate(
    Age = Age/120,
    Gender = as_factor(Gender)
  )
math_anxiety_modified

# A tibble: 599 × 6
     Age Gender Grade      AMAS RCMAS Arith
   <dbl> <fct>  <chr>     <dbl> <dbl> <dbl>
 1  11.5 Boy    Secondary     9    20     6
 2  11.7 Boy    Secondary    18     8     6
 3  11.5 Girl   Secondary    23    26     5
 4  11.9 Girl   Secondary    19    18     7
 5  11.3 Boy    Secondary    23    20     1
 6  11.2 Girl   Secondary    27    33     1
 7  11.1 Boy    Secondary    22    23     4
 8  11.6 Boy    Secondary    17    11     7
 9  11.0 Girl   Secondary    28    32     2
10  11.2 Boy    Secondary    20    30     6
# ℹ 589 more rows

In this transformation, the Age column has been scaled down by dividing the values by 120, and the Gender column has been converted into a factor with two levels: “Boy” and “Girl.”

Summary of Average AMAS Scores and Count by Gender

 math_anxiety_modified %>%
  group_by(Gender) %>%
  summarize(average_AMAS = mean(AMAS), count = n())

# A tibble: 2 × 3
  Gender average_AMAS count
  <fct>         <dbl> <int>
1 Boy            21.2   323
2 Girl           22.9   276

The summary of average AMAS scores grouped by gender reveals that girls have a slightly higher average AMAS score (22.93) compared to boys (21.17). The total count of boys in the dataset is 323, while the total count of girls is 276. This suggests that while boys and girls show close levels of math anxiety, girls exhibit a marginally higher average score in the dataset.

Summary of Average AMAS Scores and Count by Gender and Age Group

math_anxiety_modified %>%
  group_by(Gender,Age) %>%
  summarize(average_AMAS = mean(AMAS), count = n())

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

# A tibble: 474 × 4
# Groups:   Gender [2]
   Gender   Age average_AMAS count
   <fct>  <dbl>        <dbl> <int>
 1 Boy     7.76         16       1
 2 Boy     7.82         24       1
 3 Boy     7.83         13       1
 4 Boy     7.9          14       1
 5 Boy     7.91         11       1
 6 Boy     7.94         29       1
 7 Boy     7.96         16.5     2
 8 Boy     7.98          9       1
 9 Boy     7.98         29       1
10 Boy     7.99          9       1
# ℹ 464 more rows

The analysis of average AMAS scores, grouped by gender and age, provides a detailed look at how math anxiety varies across different age groups for boys and girls. By breaking down the data in this way, it is possible to explore patterns in math anxiety that may correlate with age and gender, allowing for a more nuanced understanding of how these factors influence AMAS scores.

Summary of Average RCMAS Scores and Count by Gender

math_anxiety_modified %>%
  group_by(Gender) %>%
  summarize(average_RCMAS = mean(RCMAS), count = n())

# A tibble: 2 × 3
  Gender average_RCMAS count
  <fct>          <dbl> <int>
1 Boy             18.1   323
2 Girl            20.6   276

The table shows a summary of average RCMAS (Revised Children’s Manifest Anxiety Scale) scores grouped by gender. The average RCMAS score for boys is 18.12, based on 323 participants, while the average RCMAS score for girls is 20.55, based on 276 participants. This suggests that, on average, girls exhibit slightly higher anxiety levels as measured by the RCMAS scale compared to boys.

Summary of Average RCMAS Scores and Count by Gender and Age Group

math_anxiety_modified %>%
  group_by(Gender,Age) %>%
  summarize(average_RCMAS = mean(RCMAS), count = n())

`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.

# A tibble: 474 × 4
# Groups:   Gender [2]
   Gender   Age average_RCMAS count
   <fct>  <dbl>         <dbl> <int>
 1 Boy     7.76            20     1
 2 Boy     7.82            20     1
 3 Boy     7.83            15     1
 4 Boy     7.9             17     1
 5 Boy     7.91            15     1
 6 Boy     7.94            24     1
 7 Boy     7.96            27     2
 8 Boy     7.98            23     1
 9 Boy     7.98            29     1
10 Boy     7.99            34     1
# ℹ 464 more rows

This type of summary allows us to analyze how anxiety levels, as measured by the RCMAS, vary across different age groups within each gender.

Star Trek Books Data

I am working with Star Trek Books Data.

startrek_data <- read_csv("../../data/star_trek_books.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 16369 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): title;author;date;publisher;identifier;series;subseries;nchap;nword...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

startrek_data

# A tibble: 16,369 × 1
   title;author;date;publisher;identifier;series;subseries;nchap;nword;nchar;d…¹
   <chr>                                                                        
 1 "Star Trek: Star Trek Movie Tie-In;Alan Dean Foster;2009-05-12;Simon and Sch…
 2 "Starfleet Academy: The Delta Anomaly;Rick Barba;2010-11-02;Simon Spotlight;…
 3 "Starfleet Academy: The Edge;Rudy Josephs;2010-12-28;Simon Spotlight;9781442…
 4 "Starfleet Academy: The Gemini Agent;Rick Barba;2011-06-28;Simon Spotlight;9…
 5 "Starfleet Academy: The Assassination Game;Alan Gratz;2012-06-26;Simon Spotl…
 6 "Star Trek: Into Darkness;Alan Dean Foster;2013-05-21;Gallery Books;97814767…
 7 "Captain's Table 1: War Dragons;James T. Hirk;1998-06-01;Pocket Books;978143…
 8 "Captain's Table 5: Once Burned;Mackenzie;1998-10-01;Pocket Books;9780743455…
 9 "Captain's Table 6: Where Sea Meets Sky;Christopher Pike;1998-10-01;Pocket B…
10 "For my brother, Ray, who introduced me to Star Trek and helped tune it in b…
# ℹ 16,359 more rows
# ℹ abbreviated name:
#   ¹`title;author;date;publisher;identifier;series;subseries;nchap;nword;nchar;dedication`

The Star Trek Books dataset contains 16,369 entries, each representing a book or publication related to the Star Trek franchise. Key variables include the book title, author, publication date, and publisher. The dataset also includes details such as unique identifiers, the series and subseries the book belongs to, and the number of chapters, words, and characters in each book. Additionally, there is a field for any author dedications.

Specifying Delimiter

startrek_data <- read_delim(file="../../data/star_trek_books.csv",delim =";")

Rows: 783 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (7): title, author, publisher, identifier, series, subseries, dedication
dbl  (3): nchap, nword, nchar
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset is read using a semicolon (;) delimiter.

Dataset - startrek_data

startrek_data

# A tibble: 783 × 11
   title    author date       publisher identifier series subseries nchap  nword
   <chr>    <chr>  <date>     <chr>     <chr>      <chr>  <chr>     <dbl>  <dbl>
 1 Star Tr… Alan … 2009-05-12 Simon an… 1439163391 AV     <NA>         18  77035
 2 Starfle… Rick … 2010-11-02 Simon Sp… 978144241… AV     Starflee…    14  40129
 3 Starfle… Rudy … 2010-12-28 Simon Sp… 978144241… AV     Starflee…    31  52547
 4 Starfle… Rick … 2011-06-28 Simon Sp… 978144241… AV     Starflee…    13  39495
 5 Starfle… Alan … 2012-06-26 Simon Sp… 978144242… AV     Starflee…    30  62030
 6 Star Tr… Alan … 2013-05-21 Gallery … 978147671… AV     <NA>         17  77438
 7 Captain… James… 1998-06-01 Pocket B… 978143910… CT     <NA>         21  95110
 8 Captain… Macke… 1998-10-01 Pocket B… 978074345… CT     <NA>         26  76392
 9 Captain… Chris… 1998-10-01 Pocket B… 978143910… CT     <NA>         34  78678
10 The Cap… John … 2000-03-01 Pocket B… 978074340… CT     <NA>        176 436682
# ℹ 773 more rows
# ℹ 2 more variables: nchar <dbl>, dedication <chr>

First 10 rows of the Star Trek Book dataset

startrek_data %>% 
 head(10)

# A tibble: 10 × 11
   title    author date       publisher identifier series subseries nchap  nword
   <chr>    <chr>  <date>     <chr>     <chr>      <chr>  <chr>     <dbl>  <dbl>
 1 Star Tr… Alan … 2009-05-12 Simon an… 1439163391 AV     <NA>         18  77035
 2 Starfle… Rick … 2010-11-02 Simon Sp… 978144241… AV     Starflee…    14  40129
 3 Starfle… Rudy … 2010-12-28 Simon Sp… 978144241… AV     Starflee…    31  52547
 4 Starfle… Rick … 2011-06-28 Simon Sp… 978144241… AV     Starflee…    13  39495
 5 Starfle… Alan … 2012-06-26 Simon Sp… 978144242… AV     Starflee…    30  62030
 6 Star Tr… Alan … 2013-05-21 Gallery … 978147671… AV     <NA>         17  77438
 7 Captain… James… 1998-06-01 Pocket B… 978143910… CT     <NA>         21  95110
 8 Captain… Macke… 1998-10-01 Pocket B… 978074345… CT     <NA>         26  76392
 9 Captain… Chris… 1998-10-01 Pocket B… 978143910… CT     <NA>         34  78678
10 The Cap… John … 2000-03-01 Pocket B… 978074340… CT     <NA>        176 436682
# ℹ 2 more variables: nchar <dbl>, dedication <chr>

The first 10 rows of the Star Trek Books dataset display information about books published by various publishers, such as Simon and Schuster, Simon Spotlight, Gallery Books, and Pocket Books. The dataset shows identifiers and indicates the series the books belong to, like “AV” or “CT.” Some entries also belong to subseries, such as “Starfleet Academy.” Additionally, the dataset provides details on the number of chapters (nchap) and the total word count (nword) for each book.

Glimpse - startrek_data

glimpse(startrek_data)

Rows: 783
Columns: 11
$ title      <chr> "Star Trek: Star Trek Movie Tie-In", "Starfleet Academy: Th…
$ author     <chr> "Alan Dean Foster", "Rick Barba", "Rudy Josephs", "Rick Bar…
$ date       <date> 2009-05-12, 2010-11-02, 2010-12-28, 2011-06-28, 2012-06-26…
$ publisher  <chr> "Simon and Schuster", "Simon Spotlight", "Simon Spotlight",…
$ identifier <chr> "1439163391", "9781442414259", "9781442414242", "9781442414…
$ series     <chr> "AV", "AV", "AV", "AV", "AV", "AV", "CT", "CT", "CT", "CT",…
$ subseries  <chr> NA, "Starfleet Academy", "Starfleet Academy", "Starfleet Ac…
$ nchap      <dbl> 18, 14, 31, 13, 30, 17, 21, 26, 34, 176, 9, 12, 36, 23, 44,…
$ nword      <dbl> 77035, 40129, 52547, 39495, 62030, 77438, 95110, 76392, 786…
$ nchar      <dbl> 460097, 238567, 295829, 233095, 349595, 537472, 554915, 424…
$ dedication <chr> "For Bjo and John TrimbleBecause hospitality is forever and…

This provides a glimpse into the 783 rows and 11 columns, summarizing various characteristics of Star Trek books.

Inspect - startrek_data

inspect(startrek_data)


categorical variables:  
        name     class levels   n missing
1      title character    781 783       0
2     author character    277 783       0
3  publisher character     21 772      11
4 identifier character    783 783       0
5     series character     28 783       0
6  subseries character     15  56     727
7 dedication character    372 372     411
                                   distribution
1 Kobayashi Maru (0.3%), Warped (0.3%) ...     
2 Peter David (4.9%) ...                       
3 Pocket Books (67.4%) ...                     
4  (%) ...                                     
5 TOS (26.8%), TNG (18.6%), SCE (10.7%) ...    
6 Typhon Pact (16.1%) ...                      
7  (%) ...                                     

Date variables:  
  name class      first       last min_diff max_diff   n missing
1 date  Date 1967-01-01 2017-11-28   0 days 485 days 783       0

quantitative variables:  
   name   class  min     Q1 median       Q3     max         mean           sd
1 nchap numeric    1     13     21     29.0     373     24.58816     21.61247
2 nword numeric  782  52500  70730  90994.5  687175  76190.07535  52453.34633
3 nchar numeric 4337 310520 415964 555866.5 4484069 461822.36271 326062.44928
    n missing
1 760      23
2 783       0
3 783       0

The inspection of the Star Trek book dataset reveals a comprehensive breakdown of variables in categorical, date, and quantitative formats. Among the categorical variables, we have entries such as title, author, publisher, identifier, series, subseries, and dedication. These variables contain character data, with subseries and dedication having the most missing values. For date variables, we have the date variable that records the publication dates of books, ranging from 1967-01-01 to 2017-11-28, spanning a 50-year period. The quantitative variables include nchap (number of chapters), nword (number of words), and nchar (number of characters), all numeric with varying distributions. The average number of chapters per book is approximately 24.59, while the average word count is around 76190 words, and character count averages around 461822 characters.

Skim - startrek_data

skim(startrek_data)

Data summary
Name	startrek_data
Number of rows	783
Number of columns	11
_______________________
Column type frequency:
character	7
Date	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
title	0	1.00	4	58	781
author	0	1.00	2	138	277
publisher	11	0.99	7	26	21
identifier	0	1.00	10	41	783
series	0	1.00	2	6	28
subseries	727	0.07	4	23	15
dedication	411	0.48	98	97953	372

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	1967-01-01	2017-11-28	2001-12-14	577

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
nchap	23	0.97	24.59	21.61	1	13	21	29.0	373	▇▁▁▁▁
nword	0	1.00	76190.08	52453.35	782	52500	70730	90994.5	687175	▇▁▁▁▁
nchar	0	1.00	461822.36	326062.45	4337	310520	415964	555866.5	4484069	▇▁▁▁▁

The categorical variables like title, author, and publisher are mostly complete, although subseries has significant missing values (727 missing entries) and dedication has 411 missing entries. The date column spans from 1967 to 2017 with a median date around December 14, 2001. For numeric variables, nchap (number of chapters) has 23 missing values, with an average of around 25 chapters per book. The nword (number of words) and nchar (number of characters) columns are complete, showing an average of 76,190 words and 461,822 characters per book.

Data Dictionary

Quantitative Data

nword (dbl): The number of words in the book.
nchar (dbl): The number of characters (including spaces and punctuation) in the book.
date (date): The publication date of the book.

Qualitative Data

author: The name of the author who wrote the book.
publisher: The publishing company responsible for releasing the book.
series: The main series of the Star Trek universe to which the book belongs (e.g., The Original Series, The Next Generation).
subseries: A subcategory or subseries within the main Star Trek series (e.g., Deep Space Nine).
dedication: A text string containing the book’s dedication, if applicable

Summary of Average Number of Characters and Book Count by Author

startrek_data %>% 
  group_by(author) %>%
  summarize(average_characters = mean(nchar, na.rm = TRUE), count = n())

# A tibble: 277 × 3
   author                                               average_characters count
   <chr>                                                             <dbl> <int>
 1 A. C. Crispin                                                   571858      1
 2 A.C. Crispin                                                    480611.     3
 3 Aaron Rosenberg                                                  87922.     3
 4 Adam “mojo” Lebowitz; Robert Bonchune; Jonathan Lan…            144868      1
 5 Alan Dean Foster                                                498784.     2
 6 Alan Gratz                                                      349595      1
 7 Allyn Gibson                                                    144935      1
 8 Andrew J. Robinson                                              641696      1
 9 Andy Mangels                                                    576557      1
10 Andy Mangels and Michael A. Martin                              571987      1
# ℹ 267 more rows

This gives insight into the volume of work and average text length produced by different authors in the dataset.

Summary of Average Number of Words and Book Count by Publisher

startrek_data %>% 
  group_by(publisher) %>%
summarize(average_words = mean(nword, na.rm = TRUE), count = n())

# A tibble: 22 × 3
   publisher           average_words count
   <chr>                       <dbl> <int>
 1 Abrams Publications       132041      1
 2 Aladdin                    23801.    28
 3 Aladdin Paperbacks         69076      1
 4 Amereon Ltd                40804.     2
 5 Bantam Books               50225.    20
 6 Demco Media                23310.     2
 7 Dk Publishing             119459      1
 8 Elysium                    98779      1
 9 Gallery Books              93418.    15
10 Harlequin                  63453      1
# ℹ 12 more rows

The summary of the Star Trek book dataset, grouped by publisher, provides insights into the average number of words and the total book count for each publisher. For example, Abrams Publications has one book with an average of 132,041 words, while Aladdin has 28 books with an average of 23,800 words.

Summary of Average Word Count and Total Books by Series

startrek_data %>%
  group_by(series) %>%
summarize(average_words = mean(nword, na.rm = TRUE), count = n())

# A tibble: 28 × 3
   series average_words count
   <chr>          <dbl> <int>
 1 AV            58112.     6
 2 CT           157688.     5
 3 DS9           91641.    83
 4 DSC           89367      1
 5 DTI           59215.     5
 6 ENT           82724.    19
 7 KE            74668.     4
 8 MIR          115916.     5
 9 MISC          89974.    13
10 MYR          145127      3
# ℹ 18 more rows

The grouping is done by series, and it shows a range of series with their calculated average word counts and the total number of books within each series. For example, the “TOS” series has 210 books with an average word count of 76,522.67, while the “CT” series has only 5 books but a much higher average word count of 157,687.80. There is also a wide variation in word count across different series, such as “YA-DS9” having only 23,298 words on average compared to more substantial works like “MYR” with 145,127 words. This summary helps identify how the different series compare in terms of length and volume, highlighting the diversity of content.

Summary of Total Books and Average Word Count by Series and Author

startrek_data %>%
  group_by(series, author) %>%
summarize(total_books = n(), average_words = mean(nword, na.rm = TRUE))

`summarise()` has grouped output by 'series'. You can override using the
`.groups` argument.

# A tibble: 419 × 4
# Groups:   series [28]
   series author                                total_books average_words
   <chr>  <chr>                                       <int>         <dbl>
 1 AV     Alan Dean Foster                                2        77236.
 2 AV     Alan Gratz                                      1        62030 
 3 AV     Rick Barba                                      2        39812 
 4 AV     Rudy Josephs                                    1        52547 
 5 CT     Christopher Pike                                1        78678 
 6 CT     James T. Hirk                                   1        95110 
 7 CT     John J. Ordover and Dean Wesley Smith           1       436682 
 8 CT     Keith R.A. Decandido                            1       101577 
 9 CT     Mackenzie                                       1        76392 
10 DS9    Andrew J. Robinson                              1       113304 
# ℹ 409 more rows

This allows for an analysis of how different authors contribute to various Star Trek series in terms of the number of books they have written and the average length (in words) of their works.

Summary of Total Books and Average Character Count by Year and Author

startrek_data %>%
  group_by(year(date), author) %>%
summarize(total_books = n(), average_characters = mean(nchar, na.rm = TRUE))

`summarise()` has grouped output by 'year(date)'. You can override using the
`.groups` argument.

# A tibble: 624 × 4
# Groups:   year(date) [51]
   `year(date)` author         total_books average_characters
          <dbl> <chr>                <int>              <dbl>
 1         1967 James Blish              1            235524 
 2         1968 James Blish              1            232094 
 3         1969 James Blish              1            224369 
 4         1970 James Blish              1            207001 
 5         1971 James Blish              1            239859 
 6         1972 James Blish              4            268216.
 7         1973 James Blish              1            314575 
 8         1974 James Blish              1            310749 
 9         1975 James Blish              1            332807 
10         1976 Sondra Marshak           1            431759 
# ℹ 614 more rows

This helps analyze trends over time, showing how many books each author contributed in a particular year and the typical length of those books in terms of characters.

Summary of Total Books and Average Character Count by Publisher and Series

startrek_data %>%
  group_by(publisher, series) %>%
summarize(total_books = n(), average_nchar = mean(nchar, na.rm = TRUE))

`summarise()` has grouped output by 'publisher'. You can override using the
`.groups` argument.

# A tibble: 81 × 4
# Groups:   publisher [22]
   publisher           series total_books average_nchar
   <chr>               <chr>        <int>         <dbl>
 1 Abrams Publications REF              1       898671 
 2 Aladdin             YA-DS9          11       135368.
 3 Aladdin             YA-TNG          14       141448.
 4 Aladdin             YA-TOS           2       147656.
 5 Aladdin             YA-VOY           1       130989 
 6 Aladdin Paperbacks  TOS              1       390799 
 7 Amereon Ltd         TOS              2       233809 
 8 Bantam Books        TOS             20       287111.
 9 Demco Media         YA-VOY           2       134614 
10 Dk Publishing       REF              1      1094928 
# ℹ 71 more rows

This helps in analyzing the contribution of different publishers to various Star Trek series and the typical length of books published within those series. It provides insight into the publishing trends, helping to compare the output volume and book length across different series and publishers.

Summary of Average Word Count and Total Books by Subseries

startrek_data %>%
  group_by(subseries) %>%
summarize(average_words = mean(nword, na.rm = TRUE), count = n())

# A tibble: 16 × 3
   subseries               average_words count
   <chr>                           <dbl> <int>
 1 Academy                       106196      1
 2 Dark Passions                  52072.     2
 3 Day of Honor                  116344.     6
 4 Destiny                       147208      4
 5 Dominion War                   70683.     5
 6 Gateways                      116709      1
 7 Mirror Universe Trilogy        95785.     3
 8 Prey                           97729      3
 9 Section 31                     75524.     6
10 Starfleet Academy              48550.     4
11 The Badlands                   60314.     2
12 The Brave and the Bold         66966      2
13 The Fall                       95050.     5
14 Totality                       80817.     3
15 Typhon Pact                   124866.     9
16 <NA>                           74781.   727

The summarized data from the Star Trek book dataset provides insights into the average word count and total number of books for each subseries. For example, the “Academy” subseries has 1 book with an average word count of 106,196, while “Day of Honor” spans 6 books with an average of 113,644 words. The “Destiny” subseries consists of 4 books averaging 147,208 words, and the “Typhon Pact” subseries features 9 books with an average of 124,866 words. Some subseries, like “Prey,” with 3 books averaging 97,729 words, reflect moderately sized collections. Additionally, there are entries like NA, which account for 727 books with an average word count of 74,780, indicating a group of records that might not fit into a specific subseries.